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ABSTRACT 


Open-ended learning environments (OELEs) allow students 
to freely interact with the content and to discover impor- 
tant principles and concepts of the learning domain on their 
own. However, only some students possess the necessary 
skills for efficient and effective exploration. Guidance in the 
form of targeted interventions or feedback therefore has the 
potential to improve educational outcomes. A promising 
approach for adaptation in OELEs is the design of inter- 
ventions based on the detection of characteristic learning 
behaviors through offline clustering, followed by a real-time 
classification of new students. In this paper, we explore the 
possibility of using recurrent neural network (RNN) models 
for this online classification task. We extensively evaluate 
the predictive performance of different variants of RNNs, 
namely long-short term memory models and gated recur- 
rent units, and different architectures on a data set collected 
from an exploration-based educational game. We also com- 
pare the prediction accuracy of the different RNN models to 
the performance of traditional classifiers on the same data 
set. Our results demonstrate that RNNs perform similar or 
better than traditional methods regarding early classifica- 
tion and therefore constitute a promising alternative for the 
online classification of new students. 
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1. INTRODUCTION 


Over the last years, there has been a rise in OELEs such 
as educational games or simulations. These environments 
allow students to freely interact with the content and (ide- 
ally) infer the concepts and principles of the learning domain 
through their exploration. However, previous research [20, 
31, 24] has demonstrated that few students possess the prob- 
lem solving and inquiry skills necessary to efficiently and ef- 
fectively explore the space. Individualized guidance in the 
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form of adaptive interventions or feedback therefore have the 
potential to improve students’ exploration strategies and at 
the same time optimize the educational outcomes. 
Traditionally, adaptation in computer-based learning envi- 
ronments has been based on the predictions of the student 
model. A large body of work has focused on developing stu- 
dent models that are able to accurately represent student 
knowledge. One of the most popular student modeling ap- 
proaches is Bayesian Knowledge Tracing (BKT) [8], a tech- 
nique that has been constantly refined and improved over 
the years, e.g., [34, 35]. Other widely used approaches are 
based on item response theory, such as the Additive Factors 
Model [5, 6] and Performance Factors Analysis [28]. Further- 
more, dynamic Bayesian networks, e.g., [13, 18] have been 
used to model student knowledge. All of these approaches 
are based on the assumption that the knowledge of the stu- 
dent can be represented through a set of skills (knowledge 
components) and that we can infer the knowledge about 
a specific skill based on students’ answers to tasks associ- 
ated with this skill. OELEs do not fulfill these criteria as 
they (usually) do not provide specific sequences of tasks or 
explicitly define knowledge components. Therefore, the in- 
troduced student modeling techniques cannot be (directly) 
applied to such environments. 

A prominent idea in the literature is to provide adapta- 
tion based on detected (and analyzed) learning behaviors. 
This idea has been formalized into a user modeling frame- 
work [15]: First, offline clustering is used to identify different 
types of student behaviors. The adaptive components of the 
environment are then designed with respect to the different 
behaviors found. Second, an online classification algorithm 
assigns new students to one of the clusters (and the corre- 
sponding intervention) in real time. A large amount of previ- 
ous research has focused on the offline clustering part of the 
framework, applying clustering approaches to identify differ- 
ent types of learners [25, 3, 10]. [12] have represented student 
activity patterns in massive open online courses (MOOCs) 
using behavior state-transition graphs and demonstrated that 
the extracted patterns can be interpreted. Other work as- 
sessed students’ problem solving behaviors in a game-based 
learning environment [32]. The full framework has been suc- 
cessfully applied to an environment for learning common ar- 
tificial intelligence algorithms [15]. Other researchers [16] 
have used the framework to predict the mathematical learn- 
ing patterns of students. Recently, the framework has been 
used to build student models for a more complex simulation 
of electric circuits [11]. To summarize, the presented re- 
search has mostly focused on offline clustering or the appli- 
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cation of the framework to different learning domains, per- 
forming online classification using standard algorithms such 
as k-nearest neighbor [2]. 

Recurrent neural networks (RNN) have been successfully 
used for a variety of sequence classification problems such 
as sentiment analysis, e.g., [23] and video, e.g., [9]. RNNs 
have also been used in the educational community, for exam- 
ple to model student knowledge [30], to provide personalized 
recommendations in MOOCs [27], or to improve sensor-free 
affect detection [4]. Furthermore, RNNs have also been em- 
ployed for classification problems. [26] suggested the use of 
long-short term memory (LSTM) models to classify learner 
behavior from touchscreen data. Other work [1] used LSTMs 
for the classification of problem-solving behaviors. 


In this paper, we explore the use of RNNs for online clas- 
sification: we assume the offline clustering solution to be 
given and train different classifiers to predict the cluster la- 
bel of a new student early on during interaction with the 
OELE. We hypothesize that the ability of RNNs to handle 
sequences of arbitrary length, allowing them to accumulate 
the relevant information over each time step, may benefit 
the online classification task. We investigate different types 
of RNNs, varying the models along three dimensions: the 
internal node structure used (LSTMs and gated recurrent 
units (GRU)), the depth of the network (number of layers), 
and the number of nodes in the hidden layers. We also train 
the models to either predict the whole sequence, i.e., out- 
putting the cluster label at each time step, or only predict 
the cluster label at the end of the sequence. The former ap- 
proach has the advantage that the model is able to predict 
the cluster label at any point in time. The latter approach 
requires training different models to make predictions at spe- 
cific time points, but enables the models to optimize predic- 
tions for the given point in time. We extensively evaluate 
and compare the predictive performance of all the different 
RNN models on a data set collected from an OELE [17]. 
Our results demonstrate that RNNs trained to predict the 
cluster label at each time step reach a similar predictive 
performance in early classification as the RNNs trained to 
predict the cluster label at the end of the sequence. Fur- 
thermore, despite the smaller number of parameters, GRU 
models tend to achieve a classification accuracy similar to 
LSTM models. We also compare the RNN models to tra- 
ditional classifiers on the same data set. Our findings show 
that the RNN models perform similarly or better than the 
traditional approaches regarding the prediction of cluster la- 
bels early in the game. Earlier prediction of cluster labels 
allows to provide targeted guidance sooner. We therefore 
conclude that the use of RNNs for the online classification 
of student types is promising. 


2. DATASET 


The data set at hand was collected from a short interactive 
game aiming at assessing students’ exploration choices. 


Training Environment. TugLet is an interactive game 
designed to assess students’ exploration behavior. The topic 
of the game revolves around a tug-of-war. Players can choose 
between two modes (illustrated in Fig. 1): they can engage in 
inquiry by simulating tug-of-war set-ups (Explore) or they 
can try to predict the winning side of specific tug-of-war 
set-ups and receive right-wrong feedback (Challenge). The 


Figure 1: Snapshots of Explore (top) and Challenge modes (bot- 
tom). In Explore mode, children can set-up different tug-of-war 
teams and observe the outcome. In Challenge mode, children 
have to determine the winning side of specific tug-of-war set-ups. 


Challenge mode consists of a maximum of eight problems 
ordered by increasing difficulty. The eight problems consist 
of one very easy question followed by two easy questions, 
two medium questions, and three difficult questions. If the 
student answers a problem incorrectly, (s)he is put back into 
Explore mode. The student is however free to choose to go 
back to the Challenge mode at any point in time. The game 
is over after players solve eight problems in a row correctly. 
The learning goal of the game, which is not revealed to the 
player, is to discover the mathematical principles underlying 
the tug-of-war. 


Data Set. The data set consists of log files of 229 students 
attending the 8th grade of two different middle schools. The 
total number of observations in this data set is 10’258. One 
observation corresponds to either one set-up tested in Ex- 
plore mode or one question answered in Challenge mode. 
The length | of the observation sequences varies between 
1 = 12 and 1 = 127. All students in this data set managed 
to pass the game. 


Clustering Solution. Previous work [17] has shown that 
the learning outcome (measured by an external posttest) is 


Proceedings of The 12th International Conference on Educational Data Mining (EDM 2019) 110 


not only influenced by students’ exploration choices (Explore 
vs. Challenge) but also by the quality of their inquiry strate- 
gies. It was furthermore shown [19] that students can be 
grouped into six different clusters based on these detected 
strategies. These clusters can be sematically interpreted: 
cluster 1 captures students who systematically explore and 
try to understand the mathematical principles behind the 
tug-of-war. Cluster 3 consists of students who pass the game 
fast by only using Challenge mode. Students in cluster 6 also 
do not explore, but take a long time to pass the game. Stu- 
dents in cluster 4 on the other hand simulate many different 
tug-of-war configurations in Explore mode, without success. 
Cluster 2 lies in-between clusters 1 and 3, and exploration 
behaviors in cluster 5 are a mix between those in cluster 3 
and cluster 6. The clusters are not only correlated to per- 
formance in an external posttest, but also predict academic 
achievement more broadly [19]. 

The features serving as an input for the clustering are ex- 
tracted by level: children need to answer eight Challenge 
questions in a row correctly to pass the game and there- 
fore the game can be divided into eight levels. Level n is 
reached the first time the student answers exactly n ques- 
tions in a row correctly. The features used for clustering 
consist of the following cumulative counts extracted for level 
n € [1,8]: the number of Challenge problems NC;, needed 
to reach level n, the total number of tug-of-war set-ups NE, 
simulated in Explore mode before reaching level n, and the 
number of tug-of-war set-ups NSE, simulated in Explore 
mode which are classified as reflecting systematic inquiry 
(see [17] for a definition of systematic inquiry) until reach- 
ing level n. Therefore, the input features used for the clus- 
tering are NC = [NCi,...,NCs], NE = [NE\,..., NEs], 
and NSE = [NSE\,..., NS Eg]. The cluster solution is then 
found by computing the pair-wise dissimilarities between all 
students for each feature using the Euclidean distance as a 
similarity measure and subsequently performing a pair-wise 
clustering [14]. The optimal number of clusters is deter- 
mined by the Bayesian Information Criterion (BIC) [29]. In 
the following, we will us the presented clustering solution as 
ground truth for our classification task. 


3. ONLINE CLASSIFICATION OF NEW 
STUDENTS 


Ideally, the output of the clustering algorithm enables us to 
characterize different student behaviors and to design inter- 
ventions or feedback based on the detected behaviors. Ide- 
ally, we are also able to characterize the performance of new 
learners in real time in order to deliver a targeted interven- 
tion as soon as possible. The corresponding framework is 
illustrated in Fig. 2. In the case of our data set, students 
assigned to cluster 4 might for example get hints on how 
to explore systematically, while students from cluster 3 will 
be prompted to use Explore mode in order to figure out the 
principles governing the tug-of-war. 

RNN models are a family of neural network models able 
to handle sequences of arbitrary lengths. They are espe- 
cially suited for time-series data and are able to represent 
the relevant information over a sequence of time steps. We 
therefore adapt different types of RNNs for the online clas- 
sification task (marked in dark blue in Fig 2). RNNs map 
a sequence of input features x1, xX2,...,.xr to a sequence of 
output features yi, y2,...,.yt. They maintain a sequence of 
hidden states hi, he,...,ht. These hidden states capture 
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Figure 2: Students are clustered offline and targeted inter- 
ventions are designed based on the (semantic) cluster inter- 
pretation. New students are then classified online. In this 
paper, we focus on the online classification task (marked in 
dark blue). 


Figure 3: Simple RNN unrolled over T time steps. x denotes 
the input feature vectors, y denotes the output feature vec- 
tors and the hidden states are represented by h. 


relevant information from past observations which will in- 
fluence future predictions. Figure 3 shows an illustration of 
a simple RNN model. 


3.1 Long-Short Term Memory Classification 
Long-short term memory (LSTM) models are a powerful 
modification of the RNN architecture. 


Specification. LSTM models replace each hidden state ht 
by an LSTM cell unit with additional gating parameters. 
These parameters determine when to forget or retain previ- 
ous information. The update equations of an LSTM are as 
follows: 
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time step t. The output vector ¥z can be interpreted as a 
probability distribution over the different cluster labels. 


Here, ft, it, and o¢ represent the forget, input, and output 
gates of the LSTM cell unit C,. Cj denotes an intermedi- 
ate candidate cell state. The different weight matrices are 
described by W and the b stands for bias. 


Modeling. For our task, the input vector xt encodes the 
clustering features at each time step t. We input the counts 
for each time step t, ie., xe = [NCi, NEi, NSE:]. Let us 
assume that a hypothetical student m tested three differ- 
ent set-ups, the first two being random trials and the third 
one being systematic, followed by answering two Challenge 
questions. The input features for this student over the five 
described time steps are as follows: Xm,1 = [0,1,0], xm,2 = 
[0, 2, 0], Xm,3 = [0, 3, 1], Xm,4 = (1,3, 1], Xm,5 oad (2,3, 1]. 
Figure 4 details T time steps of an example student. 

The output vectors yt represent the (predicted) cluster la- 
bels of the students: the output layer of the model uses 
the softmax function to normalize the vectors to sum to 
1 such that the values within these output vectors can be 
thought of as probabilities for the different cluster labels (see 
Fig. 4). When training the model, we provide the cluster la- 
bels found during clustering as ground truth. Note that we 
use a one-hot encoding of the cluster labels, i.e. for a stu- 
dent m belonging to cluster k = 2, ym = [0,0,1,0,0, 0]. 
The chosen model predicts the cluster label of the student 
at each time step t, augmenting the amount of available data 
and increasing flexibility, as it allows to predict the cluster 
label of a specific child at any point in time. We denote this 
type of model as LSTMsgeq. 

Note that ym, = Ym,2 = ... = Ym,T for all students m, 
because during clustering each student is assigned a fixed 
cluster label based on the whole sequence. Given that the 
cluster labels are fixed, we can also design the model to only 
output the cluster label at the end, i.e. train the LSTM to 
predict the cluster label at the end of a given input sequence. 
Instead of computing the training loss over the whole se- 
quence, we calculate the loss only for the last output Yr of 
the sequence (marked with red in Fig. 4). When using this 
type of model, we have to train a separate model for each 


prediction time point. In our case, we will train separate 
models for each level. We call this model LSTMena. 
Stacking multiple LSTM layers (see hidden layers in Fig. 4) 
is another possible variation of the architecture, i.e., the vec- 
tor ht of layer n — 1 serves as an input for layer n. Stacking 
layers adds levels of abstraction of input observations over 
time, for example enabling representation of the problem at 
different time scales. We will denote LSTM models with n 
hidden layers, where n > 1, as LSTMgeq,n or LSTMena,n. 


3.2 Gated Recurrent Unit Classification 


Gated recurrent unit (GRU) models are another powerful 
modification of RNN models. In contrast to LSTMs, they 
are less complex, making training more efficient. 


Specification. Similar to LSTM models, GRU models re- 
place each hidden unit ht by a GRU cell unit with additional 
gating parameters. GRUs use update and reset gates, decid- 
ing what information should be passed to the output. The 
update equations of a GRU are as follows: 


Zt = 0(Weext + Wenht—1 + bz) ( 
r_ = 0(Wi2xt + Winhe-1 + bi) (8 

hy = tanh(Whyext + re X Warp ht—1 + bn) ( 
ht = zt X he—-1 + (1 — zt) x hy (10 


z_ and rt represent the update and reset gate of the GRU 
cell unit. hi; denotes an intermediate candidate hidden state. 
The different weight matrices are described by W and the b 
stands for bias. 


Modeling. Just as for the LSTM models, the clustering 
features at each time step t are represented by the input 
vector Xt, ie., xt = [NC:, NE, NS'E;]. The input sequence 
of an example student is given in Fig. 4. We also use the 
exact same description for the output layer of the GRU: 
the output layer of the model uses the softmax function to 
normalize the vectors to sum to 1 such that the values within 
the output vectors ¥_ can be interpreted as probabilities for 
the different cluster labels (see Fig. 4). We again train a 
model on the whole output sequences of the students able 
to predict the cluster label at each time step t. We denote 
this model with GRUgeq. We also train one GRU model per 
level, where we compute the loss of the model only for the 
last output ¥r of the sequence (marked with red in Fig. 4). 
We denote this model with GRUgna. Finally, similar to 
LSTM models, we can also stack GRU models on top of 
each other. We will call models with n hidden layers, where 
n> al GRUseq,n or GRUgna,n- 


4. EXPERIMENTAL EVALUATION 

We evaluated the predictive accuracy of the variations of 
RNN models on the data set described in Section 2. We also 
compared them to the following popular traditional classi- 
fiers using the same data set: k-Nearest-Neighbor (kNN) and 
random forests (RF). While these classifiers are well tested 
and efficient to train, they require features being the same 
length for each student and therefore need to be trained for 
fixed points in time. RNNs on the other hand represent the 
relevant information over time, possibly enabling a more ac- 
curate classification of students early in the game. 
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Figure 5: Accuracy (top) and AUC (bottom) of the LSTMseq, LSTMgeq,2, and GRUgeq models by achieved level. Both 
measures increase up to level 6 and stagnate or deteriorate afterward. This is probably due to the fact that the models are 
trained to predict the cluster label after each time step t, i.e. training loss is optimized over the whole sequence of observations. 


Experimental Setup. We applied a train-test setting, i.e. 
parameters were fit on the training data set and performance 
of the methods was evaluated on the test data set. Predic- 
tive performance was evaluated using the accuracy as well as 
the micro-averaged area under the ROC curve (AUC). The 
accuracy is a measure that can be interpreted easily. The 
cluster solutions for both data sets are not balanced. We 
used the AUC as an additional performance measure as it is 
robust to class imbalance. 

For all methods, we used a student-stratified (i.e. dividing 
the folds by students) 10-fold cross validation. Within each 
fold f, we re-clustered the students of the training data set 
of f to obtain the output features y, i.e. the cluster labels, 
for training. We purposely did not use the original cluster la- 
bels from the solution found on the whole data set (including 
training and test data) for training, to prevent dependencies 
to the cluster labels of the test data set. The average cluster 
stability [22] between the 10 different training data sets and 
the original cluster solution was 0.87, i.e., on average 87% 
of the samples received the same label on the training data 
set as on the original data set. Therefore, 0.87 constitutes 
an upper bound for the accuracy of the classifiers. 

All the RNN models were implemented using Keras [7] with 
Theano [33] as backend. Categorical crossentropy was used 
to calculate loss and ADAM was used as the optimizer. The 
models were trained for e = 100 epochs. For all types of 
RNN models, we used post-padding and masking to account 
for the different sequence lengths. 

For the traditional classifiers, we trained one model for each 


level n of the game. The input vector xn of each model 
therefore encodes the clustering features exactly at level n, 
Le, Xn = [NCr, NEn, NSE,|. 

We determined the optimal number k, of nearest neighbors 
for the kNN classifier as follows: within each fold f, we ran- 
domly put 10% of the students from the training data set 
in a validation data set and selected the number of nearest 
neighbors k.,¢ yielding the best performance in terms of ac- 
curacy on this validation data set. We then predicted the 
cluster labels of the test data set using the labels of the ko, r 
nearest neighbors. 

For the RF method, we trained B = 100 binary decision 
trees using bootstrapping with re-sampling (rp = 1.5- My, 
with My being the number of samples in the training data 
set of fold f). 


RNN Models returning a sequence of outputs. We 
varied the parameters of our RNN models outputting the 
whole sequence of cluster labels along three dimensions: the 
structure of the hidden layer(s) (LSTM or GRU), the num- 
ber of hidden layers, and the dimension of the hidden state 
within a hidden layer. Specifically, we computed the predic- 
tive performance for the following models: a 1-layer LSTM 
(LSTMgeq) with a dn-dimensional hidden state where dp € 
{4, 8, 16, 32}, a 2-layer LSTM (LSTMgeq,2) with a d;-dimensi- 
onal hidden state per layer where d, € {2,4,8,16}, and a 
1-layer GRU (GRUseq) with a d,-dimensional hidden state 
where d;, € {4,8, 16,32}. Figure 5 illustrates the predictive 
performance in terms of accuracy and AUC for the three 
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Figure 6: Categorical crossentropy of the LSTMsgeq, LSTMseq,2, and GRUseq models by the number of dimensions of the 
hidden state. The average test error begins to increase when using more than dy, = 16 hidden dimensions, while the training 
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different architectures by achieved level. 

The LSTMseq models reach a poor accuracy when dp, = 4. 
There is also not much difference between the accuracy at 
level 1 (0.42) and at level 8 (0.49). The accuracy improves 
substantially when increasing the dimension of the hidden 
state to d;, = 8, dn = 16, or dy = 32. We especially observe a 
jump in accuracy between levels 3 (e.g., Accuracy, = 0.45) 
and 4 (e.g., Accuracyg = 0.58). For d, = 32, there is a 
second jump between levels 5 (Accuracy3, = 0.59) and 6 
(Accuracy3. = 0.64). These jumps in accuracy correspond 
to jumps in the difficulty of the Challenge questions: the 
eight problems consist of one very easy question followed by 
two easy questions, two medium questions, and three dif- 
ficult questions. There is no increase in accuracy between 
levels 1 and 3 as most students passed these easy levels very 
quickly. We observe a similar picture for the AUC. The AUC 
increases with the number of dimensions d;, of the hidden 
state. Note that the AUC again jumps between levels 3 (e.g., 
AUCs = 0.72) and 4 (e.g., AUCs = 0.81). 

For the LSTMseq,2 models, predictive performance again in- 
creases with the number of dimensions of the hidden state 
within a layer. For this type of models, using a 2-dimensional 
hidden state or a 4-dimensional hidden state per layer leads 
to a low accuracy. Increasing to an 8-dimensional hidden 
state or a 16-dimensional hidden state per layer yields a 
large improvement in accuracy and these models also show a 
jump in accuracy between level 3 (e.g., Accuracy; = 0.48) 
and level 4 (e.g., Accuracy3,;¢ = 0.63). The AUC also in- 
creases with an increasing number of dimensions d;, per hid- 
den layer. The LSTMsgeq,2 models’ accuracy is in range with 
the accuracy of the LSTMseq models for d, = 16/d, = 2x8 
hidden dimensions as well as for dn = 32/dy, = 22x16 hid- 
den dimensions. However, the LSTMseq models show a 
higher AUC than the LSTMsgeq,2 models, e.g., for d, = 
16/dy, = 2x8 hidden dimensions (for example at level 5: 
AUCLSTMgeq = 0.87, AUCLsTMgeq,2 = 0.79). 

Performance of the GRUseq models in terms of accuracy 
shows the same trends over time as for the LSTMgeg mod- 


els. Again, employing a 4-dimensional hidden state results in 
a low accuracy. When increasing the number of dimensions 
of the hidden state to dp, = 8, dn = 16, or dn = 32, the mod- 
els are able to capture the jump in difficulty between level 
3 (e.g., Accuracyg = 0.43) and level 4 (e.g., Accuracyg = 
0.54). The architecture employing a 16-dimensional hid- 
den state also shows a jump in accuracy between level 4 
(Accuracy,, = 0.59) and level 6 (Accuracy,, = 0.66). The 
AUC again increases with an increasing number of dimen- 
sions of the hidden state. Applying d, = 32 instead of 
dn = 16 hidden dimensions does generally not increase the 
AUC, and is even worse for some levels, e.g., for level 6 
(AUCi6 = 0.86, AUC32 = 0.81). Generally performance is 
in range with the performance of the LSTMseq models in 
terms of accuracy for d, = 16 and d;, = 32 hidden dimen- 
sions. Again, the AUC for the LSTMseq models tends to 
be higher than the AUC for the GRUseq models, for ex- 
ample for d;, = 16 hidden dimensions at the peak level 6 
(AUCistMg., = 0.90, AUCLstTMeRy = 0-86). 

The performance increase of all models with a higher number 
of hidden dimensions is as expected. However, the danger 
of overfitting increases with a higher number of parameters. 
While our training and evaluation methods have measures 
for overfitting, such as the crossvalidation, in place, we in- 
vestigated the relation between the average training and test 
error of the different configurations and the number of di- 
mensions d;, of the hidden state. Figure 6 illustrates the av- 
erage categorical crossentropy on the training data sets and 
test data sets. We observe that the difference between train- 
ing error and test error starts to get bigger with an increased 
number of hidden dimensions. Specifically, there is a kink 
in the test error at dx = 16/dn = 278 hidden dimensions for 
all models. We therefore conclude that d;, = 16/d, = 228 is 
the maximum number of hidden dimensions that should be 
used for this data set. 


RNN Models returning the last output in the se- 
quence. We also trained a range of RNN models returning 
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Figure 7: Accuracy (top) and AUC (bottom) of the LSTMgna, LSTMena,2, and GRUgna models by achieved level. Both mea- 
sures increase over time and all models achieve a similar predictive performance for the higher numbers of hidden dimensions. 


only the last output in the sequence, i.e. predicting the 
cluster label at the end of a given sequence. We varied the 
parameters of these models along the same dimensions as 
for the models predicting the whole sequence. Specifically, 
we computed the predictive performance for the following 
models: a 1-layer LSTM (LSTMgna) with a d;,-dimensional 
hidden state where d;, € {4, 8,16, 32}, a 2-layer LSTM 
(LSTMena,2) with a d_-dimensional hidden state per layer 
where d;, € {2,4,8,16}, and a 1-layer GRU (GRUgna) with 
a dp-dimensional hidden state where dn € {4,8, 16,32}. We 
trained one model for each level of the game. The predictive 
performance of all the models in terms of accuracy and AUC 
is illustrated in Fig. 7. 

Similar to the models predicting sequences, the LSTMgna 
model achieves the lowest accuracy with d, = 4. How- 
ever, up to level 6, there is no big difference in accuracy 
between this model and the models with dz, > 8 hidden di- 
mensions. All the LSTMgna models capture the jump in dif- 
ficulty between level 3 and level 4. The models with higher- 
dimensional hidden states also capture the second jump hap- 
pening after level 5. For d;, = 16, we for example see a jump 
in accuracy after level 3 (level 3: Accuracy = 0.46, level 
4: Accuracy = 0.62) and a second jump after level 6 (level 
6: Accuracy = 0.67, level 7; Accuracy = 0.77). Regarding 
the AUC, we observe superior performance of the models 
with d;, = 16 and d;, = 32 hidden dimensions after level 4. 
For these models, the AUC constantly increases until level 6 
(AUCi6 = 0.89, AUC32 = 0.88). As seen before, we do not 
observe any improvement in performance after level 6. 

For the LSTMena,2 model, employing d;, = 2 hidden dimen- 


sions per layer leads to the lowest achieved accuracy and 
there is also not much improvement over time. The mod- 
els with d;, > 2 capture the increased difficulty after level 3 
(e.g., level 3: Accuracy, = 0.47, level 4: Accuracys,. = 
0.57). Only the models using d,z = 8 or d, = 16 hidden 
dimensions per layer manage to capture the second jump 
in difficulty (e.g., level 6: Accuracy,;, = 0.69, level 7: 
Accuracy ),;g = 0.81). We observe a similar picture for 
the AUC: when using using d, = 8 or d, = 16 hidden 
dimensions, there is a strong increase in AUC after level 
3 (e.g., level 3: AUCazg = 0.74, level 4: AUC2zg = 0.81) 
and after level 6 (e.g., level 6: AUC2216 = 0.87, level 7: 
AUCaz16 = 0.95). 

The accuracy plot of the GRUgna models looks similar to 
the accuracy plot of the LSTMrna models. Up to level 4, 
all models perform similarly (at level 4: Accuracy, = 0.58, 
Accuracy = 0.62, Accuracy,, = 0.61, Accuracy; = 0.55). 
For the higher levels, the model with d;, = 4 hidden di- 
mensions shows the lowest accuracy. Using a model with 
an 8-dimensional hidden state nicely captures the jumps 
in accuracy between level 3 (Accuracy = 0.47) and level 
4 (Accuracy = 0.62) and between level 6 (Accuracy = 0.62) 
and level 7 (Accuracy = 0.73). Increasing the number of hid- 
den dimensions to d, = 16 improves performance only from 
level 5 on (e.g., at level 6: Accuracyg = 0.62, Accuracy), = 
0.69). This model also captures the two jumps in accuracy. 
Using dy = 32 hidden dimensions does not lead to any fur- 
ther improvements. The model employing a 4-dimensional 
hidden state performs worst for the AUC, with exception 
of level 3. When using d;, = 16 or d;, = 32 hidden dimen- 
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sions, the AUC again models the two jumps in difficulty 
(e.g., level 3: AUC32 = 0.72, level 4: AUC32 = 0.82, level 5: 
AUC32 = 0.85, level 6: AUC32 = 0.92). It also seems that 
using a higher number of hidden dimensions, i.e. d;, > 8, 
increases the stability of the AUC. With exception of the 
drop at level 3, the AUC of the model with d, = 16 hidden 
dimensions nicely increases over time. 

We again tested for overfitting, by comparing the average 
training and test loss of the different models. Just as for the 
models trained to predict the cluster label at each time step, 
we found that there is a kink at d, = 16 hidden dimensions: 
while the training error still decreases for a higher number of 
d;,, the error on the test set increases. We therefore conclude 
that d;, = 16 is the maximum number of hidden dimensions 
that can be used. 


‘Sequence versus End’. When comparing the RNN mod- 
els trained for sequence prediction to the models trained 
to predict only the last output of the sequence, we observe 
that the performance plots show the same overall trends (see 
Fig. 5 and Fig. 7). For both types, predictive performance 
in terms of the accuracy is similar for the 1-layer LSTM 
and the 1-layer GRU models. The GRUseq model generally 
has a lower accuracy than the LSTMsgceq model when using 
dp < 16. The GRUgna model exhibits a lower accuracy than 
the LSTMgna model only for d;, = 4. The models with two 
stacked layers, i.e. the 2-layer LSTM models, generally show 
a lower accuracy when employing a low number of hidden 
dimensions per layer (2x2 or 224). 

For both the ‘sequence’ and the ‘end’ RNN models, all three 
model types achieve similar accuracies for d, = 8 or dn = 16 
hidden dimensions. All RNN models show no improvement 
or even a drop in AUC after level 6. This effect is more pro- 
nounced for the models which are trained on the sequence, as 
their loss is optimized over the whole sequence. We further 
hypothesize that the length of the sequences at the higher 
levels might be too long for the RNN models to capture the 
relevant information, because even with the LSTM architec- 
ture, RNNs tend to struggle with very long data sequences. 
The main difference between the ‘sequence’ and the ‘end’ 
model is the larger increase of the accuracy with increasing 
levels. For example, for the LSTMscq with d;, = 16 the accu- 
racy at level 1 is 0.42 and the accuracy at level 7 is 0.63, while 
the accuracy of the LSTMgna model with d;, = 16 is 0.41 at 
level 1 and 0.77 at level 7. Because the ‘end’ models are op- 
timized to predict the last output of a sequence, they reach 
a higher accuracy at the end of the game (e.g. at level 8: 
Accuracy grusg,,,16 = 0-61, Accuracygrug, 4,16 = 9-77). The 
‘sequence’ RNN models are optimal on average and there- 
fore exhibit less variance over time and smoother accuracy 
and AUC curves. For the medium levels of the game, ‘se- 
quence’ and ‘end’ RNN models perform similar in terms of 
accuracy and AUC. 

All configurations exhibit a satisfying accuracy and a medium- 
high AUC for d, > 8. As a comparison, a random classi- 
fier would achieve an accuracy of 0.17 and the accuracy of a 
classifier always predicting the majority label would be 0.32. 
The AUC of these two classifiers would be 0.5. 


Comparison to traditional classifiers. We compared 
the predictive performance of selected RNN models to the 
predictive performance of the traditional classifiers. Because 
the RNN models show an increased performance with a 
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Figure 8: Accuracy (top) and AUC (bottom) of kNN, RF, 
LSTMsgeq, and LSTMgEna by achieved level. The RNNs 
achieve similar or better performance than the traditional 
classifiers up to level 6. 


higher number of hidden dimensions, we used RNN mod- 
els with the maximum number of d, = 16 not resulting in 
overfitting for comparison. In case of the ‘sequence’ mod- 
els, the LSTMgeq model achieved a similar accuracy, but a 
higher AUC than the two other model types for d;, = 16. 
We therefore selected the LSTMseq model with dy, = 16 for 
comparison. In case of the ‘end’ models, the 1-layer LSTM 
and 1-layer GRU models achieved similar results, however, 
the LSTM models were better at predicting the jumps in 
difficulty. We therefore selected the LSTMgna model with 
dn = 16 for comparison. Figure 8 displays the accuracy and 
the AUC of the kNN classifier, the RF method, a 1-layer 
LSTM model for sequence predicting with d, = 16 hidden 
dimensions (LSTMgeq), and a 1-layer LSTM model for pre- 
dicting the last output of the sequence with d;, = 16 hidden 
dimensions (LSTMrna). 

We observe that the RNN models outperform the traditional 
classifiers for the first three levels regarding the accuracy 
(e.g., at level 3: Accuracy,yy = 0.38, Accuracypp = 0.43, 
Accuracy isTMgeq = 0.45, Accuracyrstu,,,, = 9-46). The 
same holds for the middle of the game, i.e. levels 4 — 6 
(e.g., at level 5: Accuracy;yy = 0.57, Accuracypp = 0.63, 
Accuracyystms,, = 9-63, Accu- racytstMp.g = 0.67). At 
the end of the game, the accuracy of the traditional classi- 
fiers is close to the stability of the clustering (Accuracy, yn = 
0.85, Accuracypp = 0.83). 

For the RF approach and the two RNN models, we also 
computed the AUC (see Fig. 8 (bottom)). In contrast to 
the other three methods outputting probabilities for each 
cluster label, kKNN just outputs the predicted cluster label 
and we therefore did not compute the AUC for this method. 
We observe that the LSTMgeq and the LSTMgna models 
clearly outperform the RF method at the beginning of the 
game (e.g., at level 3: AUCrr = 0.71, AUC .stMgeq = 0-75, 
AUCLstMpnq = 0.73). Also in the middle of the game be- 
tween levels 4 and 6, the RNN models show a higher AUC 
than the RF method (e.g., at level 5: AUCRr = 0.82, 
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AUCtstms., = 0.89, AUCLSTMpiq = 0-87) with the ex- 
ception of level 4. When looking at the whole sequence, we 
observe a similar picture as for the accuracy: the AUC of 
the RF method is clearly higher than the AUC of the RNN 
models (at level 8: AUCrr = 0.91, AUCLstMgeq = 9-80, 
AUCLstMpna = 0.85). 

As already mentioned before we assume that the worse per- 
formance of the RNN models at the end of the game (level 
7 and level 8) is due to the fact that the input sequences 
for the RNN models become too long. Note that while most 
students manage to reach level 5 within a reasonable time 
frame, the lengths of the complete sequences vary signifi- 
cantly between the students. The lower accuracy and AUC 
of the RNN models at the end of the game are not an is- 
sue in our case because we are interested in accurate pre- 
dictions early in the game. While the RNN models show 
only a slightly increased accuracy in comparison to the tra- 
ditional methods at the beginning and in the middle of the 
game, they consistently achieve a higher AUC up to level 6, 
demonstrating their robustness towards class imbalance. 


5. DISCUSSION 


OELEs consitute a promising approach for learning. Ideally, 
the students learn the concepts and principles of a domain 
more deeply through exploration than if they are simply 
taught the principles and practice applying them. However, 
it has been shown that only a part of the students are able 
to effectively explore the space [20, 31, 24, 17]. Providing 
guidance to struggling students is therefore essential for ed- 
ucational success. 

Because OELEs allow the user to freely interact with the 
content, traditional student modeling approaches cannot di- 
rectly be applied to provide adaptation to the student. Adap- 
tation based on detected student behavior and strategies is 
therefore a promising approach. Previous work has used of- 
fline clustering to detect different student types, followed by 
online classification of new students [15, 16, 11]. 


In this paper, we focused on the task of online classifica- 
tion, i.e., predicting the student type (or behavior) early on 
during interaction with the environment to provide targeted 
guidance as early as possible. In contrast to previous work 
applying standard classifiers such as k-nearest-neighbor [15, 
16, 11], we suggested the use of RNN models for the online 
classification task. While previous research has investigated 
the use of RNN models to classify the students according 
to their problem-solving behavior based on their whole se- 
quence of interactions [1], to classify changing learner behav- 
iors over time [26], or to detect affective states over time [4], 
we explored the possibility of using RNN models to predict a 
student’s cluster label (fixed over time) as early as possible. 
We have extensively evaluated a variety of RNN models and 
compared their predictive performance to the performance 
of k-nearest neighbor and random forest classifiers. We have 
used the different levels of the game as specific time points 
for evaluation as they pose realistic time points for interven- 
tions. We have trained RNN models to predict the cluster 
label at each time step as well as RNN models optimized for 
predicting the cluster label at each level of the game. 

Not unexpectedly, the RNN models trained per level as well 
as the traditional classifiers outperform the models trained 
for predicting the whole sequence at the higher levels (level 7 
and 8) of the game. This is due to the averaging effect of the 


performance of the ‘sequence’ RNN models: during train- 
ing, the loss is computed for each time step of the sequence. 
Nevertheless, for level 4 and 5, which provide promising time 
points of intervention both in terms of accuracy of the differ- 
ent models as well as in terms of timing of intervention, the 
LSTMsgeq model with 16 hidden dimensions reaches similar 
performance as the other approaches. While this model does 
not outperform traditional approaches regarding prediction 
accuracy, it provides the potential for further adapting in- 
tervention. As the model is able to predict the cluster label 
at each time step, it is possible to provide the intervention 
at different points in time for different students depending 
on how sure the model is about the cluster label of the stu- 
dent. [21] have for example used a simple heuristic to at each 
time step decide whether the model should continue to see 
further time steps before outputting a final decision. While 
exhibiting the same accuracy, classification happened on av- 
erage at an earlier point in time. This earlier classification 
allowed to provide targeted interventions sooner. 

While the LSTMgnq model with 16 hidden dimensions is 
also outperformed by the traditional classifiers at level 7 
and 8 of the game, it shows a higher prediction accuracy 
than the kNN and RF classifiers for the first levels. Our 
results further demonstrate that the LSTMgna model with 
16 hidden nodes outperforms the RF classifier regarding the 
AUC. This is especially important, because the AUC is not 
biased by imbalanced data sets. 

We have also investigated different architectures for the RNN 
models. Our results demonstrate no large difference in the 
performance of LSTM and GRU models. However, due to 
their lower complexity, GRU models are more efficient and 
take less time to train the LSTM models. While this was 
not an issue for our small data set, it should be considered 
when training on larger data sets. 

Due to the relatively small number of samples, we were able 
to only train shallow models with one or two hidden layers, 
not fully exploiting the advantages of deep learning. Fur- 
thermore, we also had to keep the number of dimensions of 
the hidden state low. However, the results achieved on our 
small data set are promising and we assume that the RNN 
models would perform even better on larger data sets. 


In the future, we plan to design and test targeted interven- 
tions for the different clusters. Furthermore, we will collect 
a substantially larger data set to enable the training of deep 
neural networks using raw feature input only. Finally, we 
also plan to design and train the classifier such that scaf- 
folded interventions can be delivered. 
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